Skip to content

Leaderboard scaffold: standings JSON + Markov reference + verify CLI#4

Closed
protosphinx wants to merge 1 commit into
dataset-fetchfrom
leaderboard-scaffold
Closed

Leaderboard scaffold: standings JSON + Markov reference + verify CLI#4
protosphinx wants to merge 1 commit into
dataset-fetchfrom
leaderboard-scaffold

Conversation

@protosphinx
Copy link
Copy Markdown
Member

Stacked on top of #3 (v0.1 fetch). Merge #2#3 → this in order.

Summary

  • First standings file lands: leaderboard/next-event/synthetic-toy.json, with the Markov reference baseline as the inaugural entry (top-1 0.9756, top-3 1.0, n=41).
  • pm-bench leaderboard <task> <dataset> [--verify] pretty-prints the table and, with --verify, re-runs scoring on the checked-in predictions to catch drift.
  • Reference predictions are in-repo (leaderboard/predictions/next-event/synthetic-toy/markov-ref.csv.gz) so the loop is reproducible offline.

What's new

  • pm_bench/leaderboard.pyload_board, rescore, verify, standings. Pure CPython, reads gzipped or plain CSV. Truth dispatch is keyed on dataset name; today only synthetic-toy is wired (the dispatch grows a branch per pinned dataset).
  • CLI: pm-bench leaderboard <task> <dataset> prints standings; --verify fails non-zero if recorded scores don't match a fresh rescore.
  • leaderboard/README.md — submission convention; how to verify locally.
  • tests/test_leaderboard.py — 8 tests, including a drift canary that tampers with top1 in a tmp copy of the JSON and asserts verify flags it.

Why this matters

  • Locks the leaderboard JSON schema before any external submission lands.
  • Makes the Markov number the explicit floor on the leaderboard, not just a number in a README.
  • Sets up v0.4's CI workflow as a one-step follow-up: it just runs pm-bench leaderboard --verify on the changed files.

Smoke

$ pm-bench leaderboard next-event synthetic-toy --verify
verified 1 entr(ies) — no drift
next-event · synthetic-toy · top1 / top3 accuracy
----------------------------------------
markov-ref  top1=0.9756  top3=1.0000  n=41

Test plan

Roadmap impact

  • README v0.4 milestone marked 🟡 (scaffold + verify shipped; CI workflow on PRs is the remaining piece).

- leaderboard/next-event/synthetic-toy.json — first standings file,
  with the Markov-ref entry (top1 0.9756, top3 1.0, n 41)
- leaderboard/predictions/next-event/synthetic-toy/markov-ref.csv.gz —
  reference predictions, checked in so the loop is reproducible
  without hitting the network
- pm_bench/leaderboard.py — load_board, rescore, verify, standings.
  Reads gzipped or plain CSV; pure CPython (no torch / pandas)
- CLI: `pm-bench leaderboard <task> <dataset> [--verify]` —
  pretty-prints standings, optionally re-runs scoring against the
  checked-in predictions and fails if recorded != actual
- tests/test_leaderboard.py — 8 tests including a drift-detection
  canary that tampers with the recorded score and confirms verify()
  flags it
- 45 tests total (was 37); ruff clean
- README v0.4 milestone marked partial; STATUS + GOALS updated
@protosphinx
Copy link
Copy Markdown
Member Author

Merged into main as part of the audit-cleanup stack (commit 9c00b47). The full content of this PR is now on main.

@protosphinx protosphinx deleted the branch dataset-fetch May 1, 2026 17:54
@protosphinx protosphinx closed this May 1, 2026
@protosphinx protosphinx deleted the leaderboard-scaffold branch May 1, 2026 17:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant